Historic information of Crimes in a city can be used to prevent crimes similar nature in future. Many city bodies are already doing this. In this data analysis project, I will explore the City of Vancouver’s crime data(2003 - 2017) and see if we can see any trends in crime occurrences which in turn can help in predicting future crimes.
Crime data is made public by City of Vancouver on their website. The dataset used here has following attributes
More deails about the data set can be found at open data catalog of the City of Vancouver https://data.vancouver.ca/datacatalogue/crime-data.htm
The downloaded csv data from City’s catalog and other libraries used are loaded
##loading data
crime_data <- read.csv("crime_csv_all_years.csv", header = TRUE)
##loading reqlibraries
## for projecting long and latitude values on map
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate) ## for timestamp info
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
library(repr)
library(rgdal)
## Loading required package: sp
## rgdal: version: 1.3-6, (SVN revision 773)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/Users/I853328/Documents/R/win-library/3.5/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/Users/I853328/Documents/R/win-library/3.5/rgdal/proj
## Linking to sp version: 1.3-1
library(proj4)
##
## Attaching package: 'proj4'
## The following object is masked from 'package:rgdal':
##
## project
library(mapview)
Basic summary functions on dataset
head(crime_data)
## TYPE YEAR MONTH DAY HOUR MINUTE
## 1 Theft from Vehicle 2003 3 25 8 0
## 2 Theft from Vehicle 2003 9 12 16 0
## 3 Theft from Vehicle 2003 6 8 20 0
## 4 Theft from Vehicle 2003 4 9 8 0
## 5 Break and Enter Residential/Other 2003 7 29 10 0
## 6 Theft from Vehicle 2003 5 14 1 0
## HUNDRED_BLOCK NEIGHBOURHOOD X Y
## 1 6XX ROBSON ST Central Business District 491356.6 5458679
## 2 11XX PENDRELL ST West End 490467.5 5458774
## 3 11XX PENDRELL ST West End 490490.7 5458763
## 4 6XX ROBSON ST Central Business District 491356.6 5458679
## 5 7XX VICTORIA DR Grandview-Woodland 495232.4 5458391
## 6 34XX KINGSWAY AVE Killarney 497769.2 5453240
tail(crime_data)
## TYPE YEAR MONTH DAY HOUR MINUTE HUNDRED_BLOCK
## 580916 Theft from Vehicle 2018 2 2 23 30 9XX HOWE ST
## 580917 Theft from Vehicle 2018 1 11 7 50 6XX W 41ST AVE
## 580918 Theft from Vehicle 2018 1 5 16 0 6XX W 41ST AVE
## 580919 Theft from Vehicle 2018 5 16 20 30 29XX W BROADWAY AVE
## 580920 Other Theft 2018 1 20 13 16 7XX GRANVILLE ST
## 580921 Theft of Bicycle 2018 7 6 12 0 6XX W CORDOVA ST
## NEIGHBOURHOOD X Y
## 580916 Central Business District 491094.9 5458693
## 580917 Oakridge 491253.2 5453430
## 580918 Oakridge 491253.2 5453430
## 580919 Kitsilano 487502.5 5456838
## 580920 Central Business District 491292.8 5458742
## 580921 Central Business District 491860.3 5459180
str(crime_data)
## 'data.frame': 580921 obs. of 10 variables:
## $ TYPE : Factor w/ 11 levels "Break and Enter Commercial",..: 7 7 7 7 2 7 4 7 7 7 ...
## $ YEAR : int 2003 2003 2003 2003 2003 2003 2003 2003 2003 2003 ...
## $ MONTH : int 3 9 6 4 7 5 6 2 2 7 ...
## $ DAY : int 25 12 8 9 29 14 28 2 27 1 ...
## $ HOUR : int 8 16 20 8 10 1 11 18 18 1 ...
## $ MINUTE : int 0 0 0 0 0 0 28 0 0 0 ...
## $ HUNDRED_BLOCK: Factor w/ 21713 levels ""," / 3888 W 50TH AVE",..: 12515 510 510 12515 13500 7132 13145 14207 510 510 ...
## $ NEIGHBOURHOOD: Factor w/ 25 levels "","Arbutus Ridge",..: 3 24 24 3 6 10 22 3 24 24 ...
## $ X : num 491357 490468 490491 491357 495232 ...
## $ Y : num 5458679 5458774 5458763 5458679 5458391 ...
summary(crime_data)
## TYPE YEAR
## Theft from Vehicle :191704 Min. :2003
## Mischief : 77960 1st Qu.:2006
## Break and Enter Residential/Other: 64022 Median :2009
## Other Theft : 58982 Mean :2010
## Offence Against a Person : 58301 3rd Qu.:2014
## Theft of Vehicle : 40156 Max. :2018
## (Other) : 89796
## MONTH DAY HOUR MINUTE
## Min. : 1.000 Min. : 1.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 9.00 1st Qu.: 0.00
## Median : 7.000 Median :15.00 Median :15.00 Median :10.00
## Mean : 6.502 Mean :15.41 Mean :13.72 Mean :17.03
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:19.00 3rd Qu.:30.00
## Max. :12.000 Max. :31.00 Max. :23.00 Max. :59.00
## NA's :58540 NA's :58540
## HUNDRED_BLOCK NEIGHBOURHOOD
## OFFSET TO PROTECT PRIVACY: 58540 Central Business District:124888
## 7XX GRANVILLE ST : 5044 : 60924
## 6XX GRANVILLE ST : 3306 West End : 45219
## 7XX W GEORGIA ST : 2784 Fairview : 34631
## X NK_LOC ST : 2364 Mount Pleasant : 33747
## 3XX ABBOTT ST : 2162 Grandview-Woodland : 29686
## (Other) :506721 (Other) :251826
## X Y
## Min. : 0 Min. : 0
## 1st Qu.:489982 1st Qu.:5453718
## Median :491520 Median :5456877
## Mean :442626 Mean :4906817
## 3rd Qu.:493512 3rd Qu.:5458661
## Max. :511303 Max. :5512579
##
Some data cleaning and prepartion is required before any analysis.
Later during exploration, I will be plotting Crimes of type theft of/from vehicles on actual city’s map so conversion of X and Y coordinates to latitude and longitude values is required. Following chunk of code does that and appends corresponding columns to dataset replacing X and Y coordinates.
## converting x y corrdinates to latitude and longitude respectively'
proj4string<-"+proj=utm +zone=10 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
pj <- project(crime_data[, c('X', 'Y')],proj4string, inverse = TRUE)
##Appending columns to data set
crime_data<-cbind(crime_data,pj)
## changing colnames to latitude and longitude respectively for appeneded columns
colnames(crime_data)[11] <-"lon"
colnames(crime_data)[12] <-"lat"
## removing x and y coordinates
crime_data <- select(crime_data, -c('X','Y'))
On exploring dataset, it is found that location information is not revealed and kept private for certain crime categories. So I am setting the neighbourhood of those entries as “Not revealed”
head((crime_data %>% filter(NEIGHBOURHOOD=="")),5)
## TYPE YEAR MONTH DAY HOUR MINUTE
## 1 Offence Against a Person 2003 12 9 NA NA
## 2 Offence Against a Person 2003 3 9 NA NA
## 3 Offence Against a Person 2003 2 23 NA NA
## 4 Offence Against a Person 2003 8 30 NA NA
## 5 Offence Against a Person 2003 8 12 NA NA
## HUNDRED_BLOCK NEIGHBOURHOOD lon lat
## 1 OFFSET TO PROTECT PRIVACY -127.4887 0
## 2 OFFSET TO PROTECT PRIVACY -127.4887 0
## 3 OFFSET TO PROTECT PRIVACY -127.4887 0
## 4 OFFSET TO PROTECT PRIVACY -127.4887 0
## 5 OFFSET TO PROTECT PRIVACY -127.4887 0
levels(crime_data$NEIGHBOURHOOD)[1] <- "Not revealed"
As we have latitude/longitude and neighbourhood information, hundred’s block information is removed from dataset.
crime_data <- select(crime_data, -c("HUNDRED_BLOCK"))
The time information in dataset is scattered in YEAR, MONTH, HOUR, MINUTE,SECOND columns. Combining it into a single TIMESTAMP column using lubridate package. This will be quite helpful in later analyzing crime data based on timestamp.
##mutating timestamp column
crime_data<- crime_data %>% mutate(TIMESTAMP =make_datetime(YEAR, MONTH, DAY, HOUR,MINUTE))
head(crime_data)
## TYPE YEAR MONTH DAY HOUR MINUTE
## 1 Theft from Vehicle 2003 3 25 8 0
## 2 Theft from Vehicle 2003 9 12 16 0
## 3 Theft from Vehicle 2003 6 8 20 0
## 4 Theft from Vehicle 2003 4 9 8 0
## 5 Break and Enter Residential/Other 2003 7 29 10 0
## 6 Theft from Vehicle 2003 5 14 1 0
## NEIGHBOURHOOD lon lat TIMESTAMP
## 1 Central Business District -123.1188 49.28081 2003-03-25 08:00:00
## 2 West End -123.1311 49.28164 2003-09-12 16:00:00
## 3 West End -123.1307 49.28155 2003-06-08 20:00:00
## 4 Central Business District -123.1188 49.28081 2003-04-09 08:00:00
## 5 Grandview-Woodland -123.0655 49.27826 2003-07-29 10:00:00
## 6 Killarney -123.0306 49.23193 2003-05-14 01:00:00
Reordering columns in dataset for better readability.
## reordering columns to put date forward and removing minutes column as we are not analyzing data at minute level
crime_data <- crime_data[ ,c(10,2,3,4,5, 8,9,7,1) ]
## filtering data upto 2017 as 2018 is not over yet
crime_data<- crime_data %>% filter(YEAR<2018)
This is how final dataset looks like after the preparation steps.
head(crime_data)
## TIMESTAMP YEAR MONTH DAY HOUR lon lat
## 1 2003-03-25 08:00:00 2003 3 25 8 -123.1188 49.28081
## 2 2003-09-12 16:00:00 2003 9 12 16 -123.1311 49.28164
## 3 2003-06-08 20:00:00 2003 6 8 20 -123.1307 49.28155
## 4 2003-04-09 08:00:00 2003 4 9 8 -123.1188 49.28081
## 5 2003-07-29 10:00:00 2003 7 29 10 -123.0655 49.27826
## 6 2003-05-14 01:00:00 2003 5 14 1 -123.0306 49.23193
## NEIGHBOURHOOD TYPE
## 1 Central Business District Theft from Vehicle
## 2 West End Theft from Vehicle
## 3 West End Theft from Vehicle
## 4 Central Business District Theft from Vehicle
## 5 Grandview-Woodland Break and Enter Residential/Other
## 6 Killarney Theft from Vehicle
The following plot shows the yearly crime numbers in City of Vancouver.
options(repr.plot.width = 6, repr.plot.height = 7)
bar_chart_years<-
ggplot(crime_data,aes(x= YEAR))+
geom_histogram(binwidth = (max(crime_data$YEAR)-min(crime_data$YEAR))/28, fill ="royalblue3" )+
labs(y = "Number of crimes", x= "Year", title ="Crimes in Vancouver over last few years")+
theme(axis.text.x = element_text(angle = 50, vjust = 0.5),
panel.background = element_rect(fill = "lavender"))+
scale_x_continuous(breaks = unique(crime_data$YEAR))
bar_chart_years
The crime numbers show a dowwnward trent till 2012-2013. However, after 2013 there is upward trend.
The following plot shows crimes in all years based on crime types.
bar_chart_types <- ggplot(crime_data, aes(TYPE))+
geom_bar(fill = "royalblue3")+
labs(x="Crime Type", y="Number of crimes", title = "Crime categories")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5), panel.background = element_rect(fill = "lavender"))
bar_chart_types
Clearly Theft from Vehicle accounts to the maximum number of crimes reported by a big margin in last 15 years. While Crimes for categories Homicide and Vehicle Collision or Pedestrian Struck (with Fatality) accounts for the minimum numbers.
Also categories “Vehicle Collision or Pedestrian Struck (with Fatality)” and “Vehicle Collision or Pedestrian Struck (with Injury)” have small number and are closely related. So clubbing them into one single category(Veh. collision/Pedestrian struck(Injury & Fatality)) for better analysis.
Lastly, clubbing togethe “Theft of Vehicle” and “Theft of Bicycle” into a single crime category Theft of Vechicle/Bicycle Following is the R code which does that.
crime_data$TYPE<- as.character(crime_data$TYPE)
crime_data$TYPE<-sapply(crime_data$TYPE, function(x){ifelse(x =="Vehicle Collision or Pedestrian Struck (with Fatality)"|x=="Vehicle Collision or Pedestrian Struck (with Injury)",
"Veh. collision/Pedestrian struck(Injury & Fatality)",x )})
crime_data$TYPE<-sapply(crime_data$TYPE, function(x){ifelse(x =="Theft of Vehicle"|x=="Theft of Bicycle",
"Theft of Vechicle/Bicycle",x )})
crime_data$TYPE <-as.factor(crime_data$TYPE)
Plotting crime by categories again with modified labels.
bar_chart_types <- ggplot(crime_data, aes(TYPE))+
geom_bar(fill = "royalblue3")+
labs(x="Crime Type", y="Number of crimes", title = "Vancouver Crime numbers by Crime Types")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5), panel.background = element_rect(fill = "lavender"))
bar_chart_types
#### Vancouver Crime by location
Following code provides information about the crime based on neighbourhood location in city of Vancouver
bar_chart_loc <- ggplot(crime_data, aes(NEIGHBOURHOOD))+
geom_bar(fill = "royalblue3")+
labs(x="Region", y="Number of crimes", title = "Vancouver Crime numbers by Region")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5), panel.background = element_rect(fill = "lavender"))
bar_chart_loc
Clearly the crime numbers indicate that maximum number of crimes are reported from Central Business District while the minmum numbers came from Musqueam. This could be due to how densely populated is Downtown(Central Business District) area and other factors for which city of Vancouver’s census information is required.
So far we have checked crime numbers for different years, locations and types. It is quite imperative to see how all these numbers work together.
Following plot shows detailed information about yearly(2010-2017) crime numbers for each category and for different locations.
plot_all <- ggplot(crime_data %>% filter(YEAR>2009), aes(NEIGHBOURHOOD, fill = TYPE))+
geom_bar()+
labs(x="Region", y="Number of crimes", title = "Vancouver yearly crime numbers")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.text = element_text(size = 7), legend.title = element_blank(), legend.position = "bottom")+
facet_wrap(~YEAR, ncol =2)
ggsave("plot_all.png", width =16, height = 9)
plot_all
The plot shows quite useful information. * Yearly crime numbers and trend is similar for each category and location * The plot clearly shows that neighbourhood information for crimes against a person is kept private and not revealed by city of Vancouver. * The trend of crimes overall for all years and for each individual year is similar too. For instance, maximum crime numbers for 2010-17 comes from Central Business District while minimum comes from Stanley park and Musqueam which is true overall for all years as well.
Checking crime numbers overall for all years for each crime type and location.
plot_all2 <- ggplot(crime_data, aes(NEIGHBOURHOOD, fill = TYPE))+
geom_bar()+
labs(x="Region", y="Number of crimes", title = "Vancouver yearly crime numbers")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.text = element_text(size = 7), legend.title = element_blank(), legend.position = "bottom")
plot_all2
The plot clearly indicates that overall trend for all years is quite similar to individual years’ crime distribution for locations and crime types.
It will be interesting to see which month of a year, day of the week and hour of the day crimes were reported. Following section discusses relationship of crime numbers with timestamp.
Following R code provides crime statistics on monthly basis for last 15 years
bar_chart_monthly_all_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP) ), aes(month(TIMESTAMP, label=TRUE), fill = TYPE))+
geom_bar()+
labs(x="Month", y="Number of crimes", title = "Monthly Crime numbers for all years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")
bar_chart_monthly_all_years
Let check if trend is similar is similar for last 4 years.
## for individual years to see if trend is similar
bar_chart_monthly_last4_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP), YEAR>2013 ), aes(month(TIMESTAMP, label=TRUE), fill = TYPE))+
geom_bar()+
labs(x="Month", y="Number of crimes", title = "Monthly Crime numbers last 4 years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")+
facet_wrap(~YEAR,ncol=2)
bar_chart_monthly_last4_years
Clearly the numbers are similar for each year. Intersting observation is Winter months(especially February) has lowest number of crimes reported while fall months have compartvely more crimes reported.
Now checking crime statistics for day of a weeek for all years.
##day wise crime data all years
bar_chart_daily_all_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP) ), aes(wday(TIMESTAMP, label=TRUE), fill = TYPE))+
geom_bar()+
labs(x="Day", y="Number of crimes", title = "Daily Crime numbers for all years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")
bar_chart_daily_all_years
Comparing the trend with last 4 years.
bar_chart_daily_last4_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP), YEAR>2013 ), aes(wday(TIMESTAMP, label=TRUE), fill = TYPE))+
geom_bar()+
labs(x="Day", y="Number of crimes", title = "Daily Crime numbers last 4 years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")+
facet_wrap(~YEAR,ncol=2)
bar_chart_daily_last4_years
The day wise overall trends for all years and for last 4 years is similar. * The maximum number of crimes are reported for weekends(maximum on Fridays) * The minimum number of crimes are reported for Wednesdays and Thursdays. * Another interesting observation is the crime numbers are more for purple and blue regions over weekends which is theft of and theft from vehicles.
Now lets check crime numbers for particular hour of the day
## hour of the day crimes
bar_chart_hourly_all_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP) ), aes(hour(TIMESTAMP), fill = TYPE))+
geom_bar()+
labs(x="Hour of the day", y="Number of crimes", title = "Hourly Crime numbers for all years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")+
scale_x_continuous(breaks = unique(hour(crime_data$TIMESTAMP)))
bar_chart_hourly_all_years
Comparing times for last 4 years
bar_chart_hourly_last4_years <- ggplot(filter(crime_data, !is.na(TIMESTAMP), YEAR>2013 ), aes(hour(TIMESTAMP), fill = TYPE))+
geom_bar()+
labs(x="Hour of the day", y="Number of crimes", title = "Daily Crime numbers last 4 years")+
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
panel.background = element_rect(fill = "lavender"),
legend.position = "bottom")+
facet_wrap(~YEAR,ncol=2)
bar_chart_hourly_last4_years
Hourly overall trends for all years is also similar ot the overall trends for individual years. * Maximum number of crimes are reported for late evening(around 18:00 hours) * Early morning hours 02:00- 05:00 have less number of crimes reported
As there are 549833 observations for last 15 years crime data, It will not be practically feasible to analyze locations of all crime points on map.So using mapview package to plot Theft of vehicle and Theft from Vehicle for last two years to see the location of crime reported based on latitude and longitude values calculated earlier.
## loading shape file for last two years
yvr_shp1 <- readOGR(".", "crime_shp_2016")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\I853328\OneDrive - SAP SE\Data science\Github Projects\vancouverCrimeDataExploration", layer: "crime_shp_2016"
## with 34634 features
## It has 8 fields
yvr_shp2 <- readOGR(".", "crime_shp_2017")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\I853328\OneDrive - SAP SE\Data science\Github Projects\vancouverCrimeDataExploration", layer: "crime_shp_2017"
## with 33745 features
## It has 8 fields
## combining two shape files
yvr_shp<- rbind(yvr_shp1,yvr_shp2)
##converting into data frame to perform filtering operations
yvr_shp_df <- as.data.frame(yvr_shp)
## filtering for theft of and theft from vechicle data for last two years
yvr_shp_df_2016 <- yvr_shp_df %>% filter( (TYPE =="Theft from Vehicle"| TYPE == "Theft of Bicycle"| TYPE =="Theft of Vehicle"),YEAR ==2016)
yvr_shp_df_2017 <- yvr_shp_df %>% filter( (TYPE =="Theft from Vehicle"| TYPE == "Theft of Bicycle"| TYPE =="Theft of Vehicle"), YEAR ==2017)
## Converting back to Spatial points data frame to plot on map
yvr_theft_vech_2016 <- SpatialPointsDataFrame(coords = yvr_shp_df_2016[,c(9,10)], data = yvr_shp_df_2016, proj4string = CRS("+proj=utm +zone=10 +datum=NAD83 +units=m +no_defs +ellps=GRS80
+towgs84=0,0,0"))
yvr_theft_vech_2017 <- SpatialPointsDataFrame(coords = yvr_shp_df_2017[,c(9,10)], data = yvr_shp_df_2017, proj4string = CRS("+proj=utm +zone=10 +datum=NAD83 +units=m +no_defs +ellps=GRS80
+towgs84=0,0,0"))
map2016 <- mapview(yvr_theft_vech_2016, zcol = c("TYPE","N_HOOD"), legend =TRUE)
map2017<-mapview(yvr_theft_vech_2017, zcol = c("TYPE","N_HOOD"), legend =TRUE)
map2016